multi-node distributed training, submitit & composer integration demo #1753

YilunKuang · 2022-11-24T01:14:16Z

What does this PR do?

This PR adds a Jupyter notebook demo of how to submit a multi-node distributed training job on the SLURM cluster without using the composer launcher. Specifically, I show how to use submitit with composer in the multi-node distributed training setting. submitit automates SLURM cluster job submission in python scripts without any shell scripts, and this implementation shows how to properly set up the distributed training environment variables in python.

What issue(s) does this change relate to?

It's related to some of the requests posted on the MosaicML slack channel.

Before submitting

[Yes] Have you read the contributor guidelines?
[Yes] Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
Did you update any related docs and document your change?
Did you update any related tests and add any new tests related to your change? (see testing)
Did you run the tests locally to make sure they pass?
Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

review-notebook-app · 2022-11-24T01:14:20Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

mvpatel2000 · 2022-12-12T18:15:55Z

@YilunKuang thanks for opening this PR! Just wanted to confirm this is ready for review / we can take a look at it?

CC: @dakinggg @bandish-shah

YilunKuang · 2022-12-12T18:27:59Z

@YilunKuang thanks for opening this PR! Just wanted to confirm this is ready for review / we can take a look at it?

CC: @dakinggg @bandish-shah

@mvpatel2000 Yes, this is ready for review. The only change I made is I added another Jupiter notebook as a guide to do distributed training with composer without the composer launcher and I didn't change the rest of the codebase.

If this guide is accepted, I would recommend adding a reference to this guide in the Distributed Training doc https://docs.mosaicml.com/en/stable/notes/distributed_training.html

bandish-shah · 2022-12-13T01:32:06Z

Hi @YilunKuang this is great thanks for submitting this PR with an example for submitit. Unfortunately I think we'll have to mask out this specific notebook from running through our typical CI tests since we don't have any SLURM infra setup to test with.

Have you tested this notebook out on your end? If so could you please post some screen shots of the output a user is expected to see in the description of this PR? We can use that as a stop gap for any users interested in leveraging this example as well as evidence the that notebook code is functional, at least at the time of this submission.

YilunKuang · 2022-12-13T01:55:21Z

Hi @YilunKuang this is great thanks for submitting this PR with an example for submitit. Unfortunately I think we'll have to mask out this specific notebook from running through our typical CI tests since we don't have any SLURM infra setup to test with.

Have you tested this notebook out on your end? If so could you please post some screen shots of the output a user is expected to see in the description of this PR? We can use that as a stop gap for any users interested in leveraging this example as well as evidence the that notebook code is functional, at least at the time of this submission.

@bandish-shah Sure no problem. I just thought this might be helpful for someone else working with the SLURM system. Here is a screenshot of the output:

These outputs are not generated directly from the Jupyter notebook but are generated by running the last code block in the Jupyter notebook as a python file.

Also, I ran my script in a multi-node setting (2 nodes, each with 4 GPUs) so there are in total 8 output files from SLURM. Let me know and I can send over the whole zip file containing output from 8 GPUs.

bandish-shah · 2022-12-13T02:16:30Z

@YilunKuang it's very helpful as we've had others ask questions around how to use Composer with SLURM. Thank you again for doing this!

bandish-shah · 2022-12-13T02:29:07Z

@YilunKuang it looks like there's a few minor CI issues that need to be addressed before we can let this through.

Failing code quality checks, this can likely addressed by running the pre-commit hooks locally: https://github.com/mosaicml/composer/blob/dev/STYLE_GUIDE.md#12-pre-commit-hooks
Failing doctest, need to manually add the example to the Tutorials toctree, please see: https://github.com/mosaicml/composer/blob/dev/docs/source/index.rst#L73

YilunKuang · 2022-12-13T02:35:16Z

2. https://github.com/mosaicml/composer/blob/dev/docs/source/index.rst#L73

@bandish-shah Glad to help!

For 1, it seems like I cannot access the notion page. I got a message saying "You do not have access to MosaicML. Please contact an admin to add you as a member".

For 2, thanks I will try to add the additional changes to the pull request these two days

bandish-shah · 2022-12-13T02:38:49Z

Typo on my end, meant to post a link to our style guide which details how to run pre-commit hooks locally, not a notion page 😅. URL fixed in the original comment.

YilunKuang · 2022-12-15T21:18:38Z

Typo on my end, meant to post a link to our style guide which details how to run pre-commit hooks locally, not a notion page 😅. URL fixed in the original comment.

@bandish-shah Gotcha thanks! Will push my changes this weekend.

…ng/composer into multinode_submitit_demo

YilunKuang · 2022-12-19T20:10:50Z

@bandish-shah Just checked the two things you mentioned! My last commit (db9a8b9) seems to pass all the test, but when I manually merged the latest change from the dev branch with this commit 0c3aa3e something failed. Can you point me to what I should do about it? Thanks!

dakinggg · 2022-12-19T21:25:21Z

@YilunKuang Looks like it was a transient error :) I merged dev again and it looks like everything is passing, thanks!

YilunKuang · 2022-12-19T21:59:28Z

@dakinggg Thanks for looking into it! Feel free to close this pull request.

YilunKuang added 2 commits November 23, 2022 17:33

jupyter demo of composer with submitit

235ef1c

jupyter demo of composer with submitit

1415393

mvpatel2000 requested review from bcui19 and dakinggg December 12, 2022 18:28

dakinggg requested a review from bandish-shah December 12, 2022 18:32

Merge branch 'dev' into multinode_submitit_demo

6df4999

YilunKuang added 4 commits December 19, 2022 10:33

pre-commit hooks and doc check

e10e2f6

update file

03b4c99

Merge branch 'multinode_submitit_demo' of https://github.com/YilunKua…

db9a8b9

…ng/composer into multinode_submitit_demo

Merge branch 'dev' into multinode_submitit_demo

0c3aa3e

Merge branch 'dev' into multinode_submitit_demo

4ac7508

dakinggg and others added 4 commits December 19, 2022 14:01

Merge branch 'dev' into multinode_submitit_demo

d338078

Merge branch 'dev' into multinode_submitit_demo

19b607b

Merge branch 'dev' into multinode_submitit_demo

d03f769

typos, minor fixes, and attribution to the author

8e40d5f

dakinggg enabled auto-merge (squash) January 9, 2023 23:28

dakinggg approved these changes Jan 9, 2023

View reviewed changes

dakinggg merged commit ab0ae48 into mosaicml:dev Jan 9, 2023

YilunKuang deleted the multinode_submitit_demo branch January 10, 2023 00:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-node distributed training, submitit & composer integration demo #1753

multi-node distributed training, submitit & composer integration demo #1753

YilunKuang commented Nov 24, 2022 •

edited

Loading

review-notebook-app bot commented Nov 24, 2022

mvpatel2000 commented Dec 12, 2022

YilunKuang commented Dec 12, 2022

bandish-shah commented Dec 13, 2022

YilunKuang commented Dec 13, 2022

bandish-shah commented Dec 13, 2022

bandish-shah commented Dec 13, 2022 •

edited

Loading

YilunKuang commented Dec 13, 2022

bandish-shah commented Dec 13, 2022

YilunKuang commented Dec 15, 2022

YilunKuang commented Dec 19, 2022

dakinggg commented Dec 19, 2022

YilunKuang commented Dec 19, 2022

multi-node distributed training, submitit & composer integration demo #1753

multi-node distributed training, submitit & composer integration demo #1753

Conversation

YilunKuang commented Nov 24, 2022 • edited Loading

What does this PR do?

What issue(s) does this change relate to?

Before submitting

review-notebook-app bot commented Nov 24, 2022

mvpatel2000 commented Dec 12, 2022

YilunKuang commented Dec 12, 2022

bandish-shah commented Dec 13, 2022

YilunKuang commented Dec 13, 2022

bandish-shah commented Dec 13, 2022

bandish-shah commented Dec 13, 2022 • edited Loading

YilunKuang commented Dec 13, 2022

bandish-shah commented Dec 13, 2022

YilunKuang commented Dec 15, 2022

YilunKuang commented Dec 19, 2022

dakinggg commented Dec 19, 2022

YilunKuang commented Dec 19, 2022

YilunKuang commented Nov 24, 2022 •

edited

Loading

bandish-shah commented Dec 13, 2022 •

edited

Loading